This study looks in the geographic distribution of house prices across the city of London. Data is inspected, cleaned, scaled, analysed through an unsupervised learning algorithm and ultimately the predictions from this analysis are plotted onto a 3D surface to visually show areas of high prices and the general patterns of prices in the London area.
Below is a simple function to better show the result of the cross validation of the model.
def print_summary(opt_reg_object):
params = opt_reg_object.best_estimator_.get_params()
score = - opt_reg_object.best_score_
print("Nearest neighbours: %8d" % params['n_neighbors'])
print("Minkowski p : %8d" % params['p'])
print("Weighting : %8s" % params['weights'])
print("MAE Score : %8.2f" % score)
return
#Loading dependencies
import numpy as np
import scipy as sp
import pandas as pd
import sklearn as sk
import matplotlib.pyplot as plot
from sklearn.neighbors import KNeighborsRegressor as NN
from sklearn.preprocessing import StandardScaler as SS
from mpl_toolkits.mplot3d import Axes3D
The first thing to do with the data was to check to see if the data is suitable for machine learning analysis. This involves checking the variables of the data and the scaling and units of each variable to make sure they are appropriate. Since we are trying to apply a k nearest neighbour analysis we need to be especially careful with scaling as all algorithms that use distance measurements in their calculations can be biased or skewed when not scaled appropriately.
The dataset has 5 variables a primary key ID, east and north coordinates for location, a price pre observation and a floor area in square meters. There are 1405 observations in this dataset.
#Loading data
hp = pd.read_csv('hpdemo.csv', dtype = float)
hp
#Scaling the data
#Initailise scaling object and preliminary calculation
scaler = SS()
scaler.fit(hp[['east','north','fl_area']]) #This only computes the mean and std to be used for scaling later.
#Scaling the data
hp_sc = scaler.transform(hp[['east','north','fl_area']])
Once the data has been inspected and appropriately cleaned and scaled we can begin the actual analysis and interpretation. Here we will be using K nearest neighbour to classify our data and determine the best groups to organise our data into. Since the algorithm being used is distance based and the variables being used are in different units of measurement we will need to scale the variables to their z scores to make the data unitless and thus comparable.
K nearest neighbour is a supervised learning method of analysis that uses a user defined test point and a k number parameter to classify points based on their proximity and distribution around the test point. This algorithm is influenced by the distribution of the variables and thus this will have to look at the distribution of the variables and see if it is problematic for the model. At the end of this analysis house prices are plotted are show a high degree of variabilty and local spikes weighting by distance from test point or a self organising map may be an appropriate way to improve this model in the next analysis of this data.
The main variable being tested is price in pounds sterling this variable is technically a discrete variable but due to its abstract nature it can be and is used as a continuous variable in many trading and financial analysis. For this analysis we will treat it as a continuous variable with this and given the data frame is a low dimension frame only 4 variables plus id Euclidean distance is an appropriate measure of distance for this analysis although Minkowski Metrics distance will also be tested.
A cross validation will also be done in order to assess models produced, tune parameters and improve the fit of the regression model. The mean squared error will be looked at to help guide this process we are looking to minimize this error term while not over fitting the data.
#Creating the regression model object with arbitrary parameters
mod1 = NN(n_neighbors=6,weights='uniform',p=2)
#Fitting the regressors and response variable to model
price = hp['price']/1000.0 #seperate response from regressors
mod1.fit(hp_sc,price)
Below are the optimised parameters for the KNN model as calulated by the best estimator method in the sklearn package. It has a MAE of 43.65 is quite small relative to the scale of the response variable this model sbould produce good results.
#Assessing model preformance
#Initailise scoring object
mae = sk.metrics.make_scorer(sk.metrics.mean_absolute_error, greater_is_better=False)
#Create gridseracher to iterate through all parameters of model
mod_list = sk.model_selection.GridSearchCV(estimator=NN(),scoring=mae,param_grid= {'n_neighbors':range(1,35),
'weights':['uniform','distance'],
'p':[1,2]})
#Fit the data to the gridsearcher
mod_list.fit(hp[['east','north','fl_area']],price)
#Show parameters of the best model found
print_summary(mod_list)
Now that a model has been created and an optimised set of parameters have been found we can plot the predictions that our optimised models makes.
The plot we will be using is a 3d surface plot where two of the regressor (eastings and northings) will be plotted on the x and y axis to create a surface that can represent the geographically area like a map and on the z axis we will have the response variable averaged over the final regressor floor area.
First we will have to prepare and shape the data we will need three 2d arrays one for each axis. After this we can take the house price averaged over floor area array and change its value to create three different plots.
One for the house prices averaged over the average floor area for London houses.
The second plot will be the house prices averaged over a 75 square meter house.
The third plot will be the house prices averaged over a 125 square meter house.
#Preping data for plotting
#Creating x and y axis meshs
east_mesh, north_mesh = np.meshgrid(np.linspace(505000, 555800, 100,),
np.linspace(158400, 199900, 100))
#Create empty z-axis meshs that are the same size as x and y mesh
fl_mesh = np.zeros_like(east_mesh)
fl_mesh2 = np.zeros_like(east_mesh)
fl_mesh3 = np.zeros_like(east_mesh)
#Fill every z-axis cell with the average floor area of dataset and two other floor sizes
fl_mesh[:,:] = np.mean(hp['fl_area'])
fl_mesh2[:,:] = 75
fl_mesh3[:,:] = 125
#Preping the data for predictions
#Need to unravel all 2d arrays in 1d vector for the predict function
regressor_df = np.array([np.ravel(east_mesh),np.ravel(north_mesh),np.ravel(fl_mesh)]).T
regressor_df2 = np.array([np.ravel(east_mesh),np.ravel(north_mesh),np.ravel(fl_mesh2)]).T
regressor_df3 = np.array([np.ravel(east_mesh),np.ravel(north_mesh),np.ravel(fl_mesh3)]).T
#Make predictions (this prediction assumes an average floor area for every case)
hp_pred = mod_list.predict(regressor_df)
hp_pred2 = mod_list.predict(regressor_df2)
hp_pred3 = mod_list.predict(regressor_df3)
#Shape the 1d vector of predictions into 2d array for z-axis of the plot
hp_mesh = hp_pred.reshape(east_mesh.shape)
hp_mesh2 = hp_pred2.reshape(east_mesh.shape)
hp_mesh3 = hp_pred3.reshape(east_mesh.shape)
#Plot1
fig = plot.figure()
ax = Axes3D(fig)
ax.plot_surface(east_mesh, north_mesh, hp_mesh, rstride=1, cstride=1, cmap='YlOrBr',lw=0.01)
plot.title('London House Prices')
ax.set_xlabel('Easting')
ax.set_ylabel('Northing')
ax.set_zlabel('Price at Mean Floor Area')
plot.show()
#Plot2
fig = plot.figure()
ax = Axes3D(fig)
ax.plot_surface(east_mesh, north_mesh, hp_mesh2, rstride=1, cstride=1, cmap='YlOrBr',lw=0.01)
plot.title('London House Prices')
ax.set_xlabel('Easting')
ax.set_ylabel('Northing')
ax.set_zlabel('Price at 75m Floor Area')
plot.show()
#Plot3
fig = plot.figure()
ax = Axes3D(fig)
ax.plot_surface(east_mesh, north_mesh, hp_mesh3, rstride=1, cstride=1, cmap='YlOrBr',lw=0.01)
plot.title('London House Prices')
ax.set_xlabel('Easting')
ax.set_ylabel('Northing')
ax.set_zlabel('Price at 125m Floor Area')
plot.show()
Looking at the data we can see there is a great deal of variation in the price of houses in the London area. The main peaks of prices are in the city centre with a prominent dip in prices in the east end of the city.
From looking at the three plots we can see the difference in floor area metrics does not dramatically change the distribution of house prices and thus we can conclude it does not have a massive effect within this relatively tight range of floor areas (75 to 125 square footage). What would seem to be the biggest predictor of house prices is the location which makes intuitive sense.
There are also a few noteable spikes that are outliers relatve to their local surroundings one spike is at the very south of the city around 520,000 easting. Taking this spike and analysing it and it's local area may be a good idea.
A principal component analysis would be an appropriate method in order to test how much of the variability that the eastings and nothings variables account for the total variability of house prices.